Systems and methods for automatically extracting data from electronic documents using external data

ABSTRACT

In a document analysis system that receives and processes jobs from a plurality of users, in which each job may contain multiple electronic documents, to extract data from the electronic documents, a method of automatically extracting data from each received electronic document at least in part using data external to the electronic document but associated with the job containing the document is provided. The method includes: analyzing each electronic document in a job to automatically extract images and text features; and, if any of the images and text features extracted from the electronic document is not recognized, using data external to said document but associated with said job to identify the unrecognized feature, wherein the external source may be one of at least one other document in the job and a database having known values associated with the job.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. §119(e) of U.S.Provisional Patent Application No. 61/295,210, filed Jan. 15, 2010,which is hereby incorporated by reference herein in its entirety.

This application is also related to the following applications filedconcurrently herewith on Jan. 14, 2011:

U.S. patent application Ser. No. ______, entitled “Systems and methodsfor training document analysis system for automatically extracting datafrom documents;”

U.S. patent application Ser. No. ______, entitled “Systems and methodsfor automatically extracting data from electronic documents containingmultiple layout features;”

U.S. patent application Ser. No. ______, entitled “Systems and methodsfor automatically correcting data extracted from electronic documentsusing known constraints for semantics of extracted data elements;”

U.S. patent application Ser. No. ______, entitled “Systems and methodsfor automatically reducing data search space and improving dataextraction accuracy using known constraints in a layout of extracteddata elements;”

U.S. patent application Ser. No. ______, entitled “Systems and methodsfor automatically processing electronic documents using multiple imagetransformation algorithms;”

U.S. patent application Ser. No. ______, entitled “Systems and methodsfor automatically extracting data from electronic documents usingmultiple character recognition engines;”

U.S. patent application Ser. No. ______, entitled “Systems and methodsfor automatically extracting data by narrowing data search scope usingcontour matching;”

U.S. patent application Ser. No. ______, entitled “Systems and methodsfor automatically extracting data from electronic document pageincluding multiple copies of a form;” and

U.S. patent application Ser. No. ______, entitled “Systems and methodsfor automatically grouping electronic document pages.”

FIELD OF THE INVENTION

This invention relates generally to systems and methods to extract datafrom electronic documents, and more particularly to systems and methodsfor automatically extracting data from electronic documents usingexternal data.

BACKGROUND

Millions of documents are produced every day that are reviewed,processed, stored, audited and transformed into computer-readable data.Examples include accounts payable, collections, educational forms,financial statements, government documents, human resource records,insurance claims, legal papers, medical records, mortgages, nonprofitreports, payroll records, shipping documents and tax forms.

These documents generally require data to be extracted in order to beprocessed. Data extraction can be primarily clerical in nature, such asin inputting information on customer survey forms. Data extraction canalso be an essential portion of larger technical tasks, such aspreparing income tax returns, processing healthcare records or handlinginsurance claims.

Various techniques, such as Electronic Data Interchange (EDI) attempt toeliminate human processing efforts by coding and transmitting thedocument information in strictly formatted messages. Electronic DataInterchange is known for custom computer systems, cumbersome softwareand bloated standards that defeated its rapid spread throughout thesupply chain. Perceived as too expensive, the vast majority ofbusinesses have avoided implementing EDI. Similarly, applications ofXML, XBRL and other computer-readable document files are quite limitedcompared to the use of documents in paper and digital image formats(such as PDF and TIFF.)

Ideally, these documents would be capable of being both read by peopleand automatically processed by computers. Since paper and digital imagefiles comprise an overwhelming percentage of all documents, it would bemost practical to train computers to extract data from human-readabledocuments.

To date, there have been three general methods of performing dataextraction on documents: conventional, outsourcing and automation.

Conventional data extraction, the first method, requires workers withspecific education, domain expertise, particular training, softwareknowledge and/or cultural understanding. Data extraction workers mustrecognize documents, identify and extract relevant information on thedocuments and enter the data appropriately and accurately in particularsoftware programs. Such manual data extraction is complex,time-consuming and error-prone. As a result, the cost of data extractionis often quite high; numerous studies estimate the cost of processinginvoices in excess of ten dollars each. The cost is especially high whenthe data extraction is performed by accountants, lawyers, physicians andother highly paid professionals as part of their work. For example,professional tax preparers report spending hours on each client taxreturn transcribing salary, interest, dividend and capital gains data;they also admit to human data extraction/entry accuracies of less than90%.

Conventional data extraction also exposes all documents in theirentirety to data extraction workers. These documents may have sensitiveinformation related to individuals' and organizations' education,employment, family, financial, health, insurance, legal, tax, and/orother matters.

Whereas conventional data extraction is entirely paper-based,outsourcing and automation begin by converting paper to digital imagefiles. This step is straightforward, aided by high quality, fast,affordable scanners that are available from many vendors includingBell+Howell, Canon, Epson, Fujitsu, Kodak, Panasonic and Xerox.

Once paper documents are converted to digital image files, documentprocessing can be made more productive through the use of workflowsoftware that routes the documents to the lowest-cost labor available,either in-house or outsourced, on-shore or overseas. Primary processingcan be done by junior personnel; exceptions can be handled by morehighly trained people. Despite the potential productivity gains that areenabled with workflow software in the form improved labor utilization,manual document processing remains a fundamentally expensive process.

Outsourcing, the second method of data extraction, requires the sameworker education, expertise, training, software knowledge and/orcultural understanding. As with conventional data extraction, outsourceddata extraction workers must recognize documents, find relevantinformation on the documents, extract and enter the data appropriatelyand accurately in particular software programs. Since outsourcing ismanual, just as is conventional data extraction, it is also complex,time-consuming and error-prone. Outsourcing firms such as Accenture,Datamatics, Hewlett Packard, IBM, Infosys, Tata, and Wipro, often reducecosts by offshoring data extraction work to locations with low wage dataextraction workers. For example, extraction of data from US tax andfinancial documents is a function that has been implemented usingthousands of well-educated, English-speaking workers in India and otherlow wage countries.

The first step of outsourcing requires organizations to scan financial,health, tax and/or other documents and save the resulting image files.These image files can be accessed by data extraction workers via severalmethods. One method stores the image files on the source organizations'computer systems; the data extraction workers view the image files overnetworks (such as the Internet or private networks.) Another methodstores the image files on third-party computers systems; the dataextraction workers view the image files over networks. An alternativemethod transmits the image files from source organizations over networksand stores the image files for viewing by the data extraction workers onthe data extraction organizations' computer system.

For example, an accountant may scan the various tax forms containingclient financial data and transmit the scanned image files to anoutsourcing firm. An employee of the outsourcing firm extracts theclient financial data and enters it into an income tax software program.The resulting tax software data file is then transmitted back to theaccountant.

Quality problems with offshore data extraction work have been reportedby many customers. Outsourced service providers address these problemsby hiring better educated and/or more experienced workers, providingthem more extensive training, extracting and entering data two or moretimes and/or exhaustively checking their work for quality errors. Thesemeasures reduce the cost savings expected from offshore outsourcing.

Outsourcing and offshoring are accompanied with concerns over securityrisks associated with fraud and identity theft. These security concernsapply to employees and temporary workers as well as outsourced workersand offshore workers who have access to documents with sensitiveinformation.

Although the transmission of scanned image files to the data extractionorganization may be secured by cryptographic techniques, the sensitivedata and personal identifying information are in the clear, i.e.,unencrypted, when read by data extraction workers prior to entry in theappropriate computer systems. Data extraction organizations publiclyrecognize the need for information security. Some data extractionorganizations claim to investigate and perform background checks ofemployees. Many data extraction organizations claim to strictly limitphysical access to the rooms in which the employees enter the data;further, such rooms may be isolated. Paper, writing materials, camerasor other recording technology may be forbidden in the rooms.Additionally, employees may be subject to inspection to ensure thatnothing is copied or removed. Since such seemingly comprehensivesecurity precautions are primarily physical in nature, they areimperfect.

Because of these imperfections, lapses in physical security haveoccurred. For example, Social Security Numbers and bank routing numbersare only nine digits; bank account numbers are usually of similarlength. Memorizing these important numbers would not be difficult andwould allow a nefarious employee to have direct access to the money heldin those accounts. For example, in 2004 employees of MphasiS in Pune,India allegedly stole $426,000 from Citibank customers. The owners,managers, staff, guards and contractors of data extraction organizationsmay misuse some or all of the unencrypted confidential information intheir care. Further, breaches of physical and information systemsecurity by external parties can occur. Because data extractionorganizations are increasingly located in foreign countries, there isoften little or no recourse for American citizens victimized in thismanner.

Information security has been the identified for seven consecutive yearsas the most important technology initiative by the Top TechnologyInitiatives survey of the American Institute of Certified PublicAccountants (AICPA.) National and state laws have been enacted and newregulations have been implemented to address these security concerns,particularly those related to outsourced data extraction that isperformed offshore.

The third general method of data extraction involves partial automation,often combining optical character recognition, human inspection andworkflow management software.

Software tools that facilitate the automated extraction andtransformation of document information are available from severalvendors including ABBYY, AnyDoc Software, EMC Captiva, Kofax and Nuance.The relative operating cost savings facilitated by these tools isproportional to the amount of automation, which depends on theapplication, quality of software customization, variety and quality ofdocuments and other factors.

Automation requires customizing and/or programming data extractionsoftware tools to properly recognize and process a specific set ofdocuments for a specific domain. Because such customization projectsoften cost upwards of hundreds of thousands of dollars, data extractionautomation is usually limited to large organizations that can affordsignificant capital investments.

The first step of a partially automated data extraction operation is toscan financial, health, tax and/or other documents and save theresulting image files. The scanned images are compared to a database ofknown documents. Images that are not identified are routed to dataextraction workers for conventional processing. Images that areidentified have data extracted using templates, either location-based orlabel-based, along with optical character recognition (OCR) technology.

Optical character recognition is imperfect, often mistaking more thanone percent of the characters on clean, high quality documents. Manydocuments are neither clean nor high quality, suffering from beingfolded or marred before scanning, distorted during scanning and degradedduring post-scanning binarization. As a result, some of the labelsneeded to identify data are often not recognizable; therefore, some ofthe data cannot be automatically extracted.

Using conventional software tools, vendors report being able to extractup to 80-90% of the data on a limited number of typical forms. When awide range of forms exists, such as the 10,000 plus variations of W-2,1099, K-1 and other personal income tax forms, automated data extractionis quite limited. Despite years of efforts, several tax documentautomation vendors claim 50% or less data extraction and admit tonumerous errors with conventional data extraction methods.

Correcting errors entails human inspection. Inspection requires workerswith the same capabilities of data extraction workers, namely specificeducation, domain expertise, particular training, software knowledgeand/or cultural understanding. Inspection workers must recognizedocuments, find relevant information on the documents and insure thatthe data has been accurately extracted and appropriately entered inparticular software programs. Typically, any changes made by inspectionworkers must be reviewed and approved by other, more senior, inspectionworkers before replacing the data extracted by optical characterrecognition. Because automation requires human inspection, sourcedocuments with sensitive information are exposed in their entirety todata extraction workers.

SUMMARY OF INVENTION

The invention is directed to systems and methods for automaticallyextracting data from electronic documents using external data.

In a preferred embodiment, a method in a document analysis system thatreceives and processes jobs from a plurality of users, in which each jobmay contain multiple electronic documents, to extract data from theelectronic documents is provided. The method automatically extracts datafrom each received electronic document at least in part using dataexternal to the electronic document but associated with the jobcontaining the document. The method includes: analyzing each electronicdocument in a job to automatically extract images and text features;and, if any of the images and text features extracted from theelectronic document is not recognized, using data external to saiddocument but associated with said job to identify the unrecognizedfeature, wherein the external source may be one of at least one otherdocument in the job and a database having known values associated withthe job.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is illustrated in the figures of the accompanying drawingswhich are meant to be exemplary and not limiting, in which likereferences are intended to refer to like or corresponding part, and inwhich:

FIG. 1 is a system diagram of a document data extraction system 100according to a preferred embodiment of the disclosed subject matter;

FIG. 2 is a system diagram of the image capture system 110 according toa preferred embodiment of the disclosed subject matter;

FIG. 3 is a system diagram of the web server system 120 according to apreferred embodiment of the disclosed subject matter;

FIG. 4 is a system diagram of the document processing system 130according to a preferred embodiment of the disclosed subject matter;

FIG. 5 is a system diagram of the image processing system 422 accordingto a preferred embodiment of the disclosed subject matter;

FIG. 6 is a system diagram of the classification system 432 according toa preferred embodiment of the disclosed subject matter;

FIG. 7 is a system diagram of the grouping system 442 according to apreferred embodiment of the disclosed subject matter;

FIG. 8 is a system diagram of the data extraction system 452 accordingto a preferred embodiment of the disclosed subject matter;

FIG. 9 is an illustration of three-step document submission processaccording to a preferred embodiment of the disclosed subject matter;

FIG. 10 is an illustration of the nine types of point patterns accordingto a preferred embodiment of the disclosed subject matter;

FIG. 11 is an illustration of image processing prior to OCR according toa preferred embodiment of the disclosed subject matter;

FIG. 12 is an illustration of a log polar histogram according to apreferred embodiment of the disclosed subject matter;

FIG. 13 is a flow diagram of the service control manager 410 accordingto a preferred embodiment of the disclosed subject matter;

FIG. 14 is an illustration of label contour matching according to apreferred embodiment of the disclosed subject matter;

FIG. 15 is a flow diagram of a CRK classifier according to a preferredembodiment of the disclosed subject matter;

FIG. 16 is a schematic of a CRK classifier according to a preferredembodiment of the disclosed subject matter;

FIG. 17 is an illustration of relative location matching of labelsaccording to a preferred embodiment of the disclosed subject matter;

FIG. 18 is an exemplary computer system on which the described inventionmay run according to a preferred embodiment of the disclosed subjectmatter;

FIG. 19 is an illustration of boxes containing labels and values;

FIG. 20 is an illustration of check boxes;

FIG. 21 is an illustration of address blocks;

FIG. 22 is an illustration of an instruction block;

FIG. 23 is an illustration of a table;

FIG. 24 is an illustration of a multi-copy form;

FIG. 25 is an illustration of an image with (A) confetti, (B) confettiwith identified labels and (C) confetti with identified labels andlabels with potential table headers grouped horizontally;

FIG. 26 is an illustration of a table with a header that needsreconstruction;

FIG. 27 is an illustration of a table with an instruction block at thebottom;

FIG. 28 is an illustration of a portion of a table with noise removedand most data correctly extracted;

FIG. 29 is an illustration of row formation in a table;

FIG. 30 is an illustration of column formation in a table;

FIG. 31 is an illustration of header association for a table;

FIG. 32 is an illustration of a table with extracted data viewed througha debug tool; note the incorrectly formed rows due to the “Corrected”overlay. Rows 1, 2, 3, 5, 6, 7, 8, and 9 are merged, but row 4 and therest of the table was extracted properly;

FIG. 33 is an illustration of the an image being extracted via a processof progressive refinement and reduced character set OCR;

FIG. 34 is an illustration of the an image being extracted via a processof progressive refinement based on increasing knowledge about the form;

FIG. 35 is an illustration of the an image being extracted via a processof progressive refinement based on utilizing knowledge gained from oneform to extract data from another form;

FIG. 36 is an illustration of data external to the input image that isused to extract and verify data from the input image;

FIG. 37 is an illustration of a form with an obscured label;

FIG. 38 is an illustration of the data extracted from the form shown inFIG. 37;

FIG. 39 is an illustration of a form with a degraded image that resultsin incorrectly extracted data;

FIG. 40 is an illustration of the data extracted from the form shown inFIG. 39;

FIG. 41 is an illustration of a portion of a W-2 form;

FIG. 42 is an illustration of the internal representation of the datacorresponding to the form in FIG. 41 as a partial layout graph;

FIG. 43 is an illustration of the internal representation of the datacorresponding to the form in FIG. 41 after labels are detected;

FIG. 44 is an illustration of the labels associated with a 1099-OIDform;

FIG. 45 is an illustration of a table;

FIG. 46 is an illustration of the table shown in FIG. 45 with columnsidentified;

FIG. 47 is an illustration of the table shown in FIG. 45 with columnsand labels identified;

FIG. 48 is an illustration of the table shown in FIG. 45 with columns,labels and header identified;

FIG. 49 is an illustration of the table shown in FIG. 45 with columns,labels, header and rows identified;

FIG. 50 is an illustration of four occurrences of image fields for“Wages, tips, other comp.” box on a single W-2 form;

FIG. 51 is an illustration of the data records corresponding to theimage fields shown in FIG. 50.

DETAILED DESCRIPTION

While the prior art attempts to reduce the cost of data extractionthrough the use of low cost labor and partial automation, none of theabove methods of data extraction (1) eliminates the human labor and itsaccompanying requirements of education, domain expertise, training,software knowledge and/or cultural understanding, (2) minimizes the timespent entering and quality checking the data, (3) minimizes errors, (4)protects the privacy of the owners of the data without being dependenton the security systems of data extraction organizations and (5)eliminates the cost for significant up-front engineering efforts. Whatis needed, therefore, is a method of performing data extraction thatovercomes the above-mentioned limitations and that includes the featuresenumerated above.

Preferred embodiments of the present invention provides a method andsystem for extracting data from paper and digital documents into aformat that is searchable, editable and manageable.

FIG. 1 is a system diagram of a document data extraction system 100according to a preferred embodiment of the invention. System 100 has animage capture system 110, and a web server system 120 and a documentprocessing system 130. In the preferred embodiment, the image capturesystem 110 is connected to the web server system 120 by a network suchas a local-area network (LAN) a wide-area network (WAN) or the Internet.The preferred implementation transfers all data over the network usingSecure Sockets Layer (SSL) technology with enhanced 128-bit encryption.Encryption certificates can be purchased from well respected certificateauthorities such as VeriSign and thawte or can be generated by usingnumerous key generation tools in the market today, many of which areavailable as open source. Alternatively, the files may be transferredover a non-secure network, albeit in a less secure manner. The webserver system 120 is connected to the document processing system 130 viasoftware within a computer system. Other embodiments of the inventionmay integrate the document processing system 110 with the image capturesystem 130. In this case, the web server system 120 is not necessary.

Under typical operation, System 110 is an image capture system thatreceives physical documents and scans them. The image capture system 110is described in greater detail below.

Under typical operation, System 120 is a web server system that receivesthe scanned documents and returns the extracted data over the Internet.Some embodiments of the invention may not have a web server system 120.The web server system 120 is described in greater detail below.

Under typical operation, System 130 is a document processing system. Thedocument processing system 130 extracts the received data into files anddatabases per a predetermined scheme. Under preferred embodiments, thedocument processing system 130 is comprised of several modules that arepart of a highly distributed architecture which consists of severalindependent processes, data repositories and databases which communicateand pass messages to each other via well defined standard andproprietary interfaces. Even though the document processing system 130may be built in a loosely coupled manner to achieve maximum scalabilityand throughput, the same results can be achieved if the documentprocessing system 130 was more tightly coupled in a single process witheach module being a logical entity of the same process. Furthermore, thedocument processing system 130 supports multiple different product typeswhich may process anywhere from hundreds to millions of documents everyday for tens to thousands of customers in different markets. Underpreferred embodiments, the document processing system 130 utilizesserver(s) hosted in a secure data center so that documents fromhealthcare, insurance, banking, government, tax and other applicationsare processed per security policies that are HIPAA, GLBA, SAS70, etc.compliant. The document processing system 130 includes mechanisms forlearning documents. The document processing system 130 is described ingreater detail below.

FIG. 2 is system diagram of the image capture system 110 according to apreferred embodiment of the invention. System 110 has a scanning system212, a user interface system 222, a data acquisition system 225, a datatransfer system 232 and an image pre-processing system 235. Sourcedocuments 210 in the form of papers are physically placed on an inputtray of a commercial scanner. Source documents in the form of data filesare received over a network by the user interface system 222. The userinterface system 222 communicates with the scanning system 212 viasoftware within a computer system, or, optionally over a computernetwork. The user interface system 222 may be part of the scanningsystem 212 in some embodiments of the image capture system 110. The userinterface system 222 communicates with the data acquisition system 225via software within a computer system. The user interface system 222communicates with the data transfer system 232 via software within acomputer system. The data acquisition system 225 communicates with thescanning system 212 via a physical connection, such as a high-speedUniversal Serial Bus (USB) 2.0, or, optionally, over a network. The dataacquisition may also be part of the scanning system 212 in certainembodiments of the image capture system 110. The data acquisition system225 communicates with the image pre-processing system 235 via softwarewithin a computer system. The data transfer system 232 communicates withthe image pre-processing system 235 via software within a computersystem. The data acquisition system and the data transfer system mayalso be part of the scanning system 212 in some embodiments of the imagecapture system 110.

Element 210 is a source document in the form of either one or morephysical sheets of paper, or a digital file containing images of one ormore sheets of paper. The digital file can be in one of many formats,such as PDF, TIFF, BMP, or JPEG.

System 212 is a scanning system. Under preferred embodiments,conventional scanning systems may be used such as those fromBell+Howell, Canon, Fujitsu, Kodak, Panasonic and Xerox. Theseembodiments include scanners connected directly to a computer, sharedscanners connected to a computer over a network, and smart scanners thatinclude embedded computational functionality to add third-partyapplications. The scanning system 212 captures an image of the scanneddocument as a computer file; the file is often in a standard format suchas PDF, TIFF, BMP, or JPEG.

System 222 is a user interface system. Under preferred embodiments, theuser interface system 222 runs in a browser and presents a user with athree-step means for submitting documents to be organized as shown inFIG. 9. In step one, the user interface system 222 provides a mechanismfor selecting a job from a list of jobs; additionally, it allows jobs tobe added to the job list. In step two, the user interface system 222provides a mechanism for initiating the scanning of physical papers;additionally, it provides a browsing mechanism for selecting a file on acomputer or network. Optionally, one or more sets of papers can bescanned and one or more files can be selected. In step three, the userinterface system 222 provides a mechanism for sending the jobinformation and selected documents over a network to the server system.Under preferred embodiments, the user interface system 222 also presentsa user with the status of jobs that have been submitted as submitted orcompleted; optionally, it presents the expected completion date and timeof submitted jobs that have not been completed. The user interfacesystem 222 also presents a user with a mechanism for receiving submitteddocuments and extracted data. The user interface system 222 alsoprovides a mechanism for deleting files from the system. Otherembodiments of the user interface system 222 may run within anapplication that provides the scan feature as part of a broaderfunction, or within a simple data entry system that is composed of onlya touch screen and/or one or more buttons. Furthermore, the userinterface system 222 also may also be embodied by a programmable APIthat provides the same or similar functionality to another applicationprogram.

System 225 is a data acquisition system. Under preferred embodiments,the data acquisition system 225 controls the settings of the scanningsystem. Many scanning systems in use today require users to manually setscanner settings so that images are captured, for example, at 300 dotsper inch (dpi) as binary data (black-and-white.) Commercial scanners andscanning software modify the original source document image that ofteninclude high resolution and, possibly, color or gray-scale elements. Theresolution is often reduced to limit file size. Color and gray-scaleelements are often binarized, e.g. converted to black or white pixels,via a process known as thresholding, also to reduce file size. Underpreferred embodiments, the data acquisition system sets the scanparameters of the scanning system. The data acquisition system commandsthe scanning system to begin operation and receives the scanned documentcomputer file from the scanning operation. The data acquisition system225 could be part of the scanning system 212, in certain embodiments.Moreover, the operation of the data acquisition system 212 could beautomatically triggered by the scan function, in certain embodiments.Reference “System for Optimal Document Scanning” U.S. patent applicationSer. No. 12/351,302.

System 232 is a data transfer system. Under preferred embodiments, thedata transfer system 232 manages the SSL connection and associated datatransfer with the server system. The data transfer system 232 could bepart of the scanning system 212, in certain embodiments. Moreover, theoperation of the data transfer system 232 could be automaticallytriggered by the scan function, in certain embodiments.

System 235 is an optional image pre-processing system. The imagepre-processing system 235 enhances the image quality of scanned imagesfor a given resolution and other scanner settings. The imagepre-processing system 235 may be implemented as part of the imagecapture system as depicted on FIG. 2 or as part of the server system asdepicted on FIG. 3. When part of the image capture system, the imagepre-processing system may also be implemented within the scanning system212, in certain embodiments. Details of the image pre-processing system235 are described in further detail below as part of the documentprocessing system 130.

FIG. 3 is a system diagram of the web server system 120 according to apreferred embodiment of the invention. System 120 has a web servicessystem 310, an authentication system 312 and a content repository 322.The web services system 310 communicates with the authentication system312 via software within a computer system. The web services system 310communicates with the content repository 322 via software within acomputer system.

System 310 is a web services system. Under preferred embodiments, theweb services system 310 provides the production system connection to thenetwork that interfaces with the image capture system. Such a networkcould be a local-area network (LAN), a wide-area network (WAN) or theInternet. As described above, the preferred implementation transfers alldata over the network using Secure Sockets Layer (SSL) technology withenhanced 128-bit encryption. Standard web services include Apache,RedHat JBoss Web Server, Microsoft IIS, Sun Java System Web Server, IBMWebsphere, etc. Under preferred embodiments, users upload their sourceelectronic documents or download their organized electronic documentsand extracted data in a secure manner using HTTP or HTTPS. Othermechanisms for secure data transfer can also be used. The web servicesystem 310 also relays necessary parameters to the application serversthat will process the electronic document.

System 312 is an authentication system. The authentication system 312allows secure and authorized access to the content repository 322. Underpreferred embodiments, an LDAP authentication system is used; however,other authentication systems can also be used. In general, an LDAPserver is used to process queries and updates to an LDAP informationdirectory. For example, a company could store all of the following veryefficiently in an LDAP directory:

-   -   The company employee phone book and organizational chart    -   External customer contact information    -   Infrastructure services information, including NIS maps, email        aliases, and so on    -   Configuration information for distributed software packages    -   Public certificates and security keys

Under a preferred embodiment, document organization and access rightsare managed by the access control privileges stored in the LDAPrepository.

System 322 is a content repository. The content repository 322 can be asimple file system, a relational database, an object oriented database,any other persistent storage system or technology, or a combination ofone or more of these. Under a preferred embodiment, the contentrepository 322 is based on Java Standard Requests 170 (JSR 170.) JSR 170is a standard implementation-independent way to access contentbi-directionally on a granular level within a content repository. Thecontent repository 322 is a generic application “data store” that can beused for storing both text and binary data (images, word processordocuments, PDFs, etc.) One key feature of a content repository is thatone does not have to worry about how the data is actually stored: datacould be stored in a relational database (RDBMS) or a file system or asan XML document. In addition to providing services for storing andretrieving the data, most content repositories provide advanced servicessuch as uniform access control, searching, versioning, observation,locking, and more.

Under preferred embodiments, documents in the content repository 322 areavailable to the end user via a portal. For example, in the currentimplementation of the system, the user can click on a web browserapplication button “View Source Document” in the portal and view theoriginal scanned document over a secure network. Essentially, thecontent repository 322 serves as an off-site secure storage facility forusers' electronic documents.

FIG. 4 is a system diagram of the document processing system 130according to a preferred embodiment of the invention. System 130 has aservice control manager 410, a job database 414, an image processingsystem 422, a classification system 432, a grouping system 442 and adata extraction system 452. The service control manager 410 communicateswith the job database 414 via software within a computer system. Theservice control manager 410 communicates with the image processingsystem 422 via software within a computer system. The service controlmanager 410 communicates with the classification system 432 via softwarewithin a computer system. The service control manager 410 communicateswith the grouping system 442 via software within a computer system. Theservice control manager 410 communicates with the data extraction system452 via software within a computer system. The image processing system422 communicates with the job database 414 via software within acomputer system. The classification system 432 communicates with the jobdatabase 414 via software within a computer system. The grouping system442 communicates with the job database 414 via software within acomputer system. The data extraction system 452 communicates with thejob database 414 via software within a computer system. The imageprocessing system 422 communicates with the classification system 432via software within a computer system. The classification system 432communicates with the grouping system 442 via software within a computersystem. The grouping system 442 communicates with the data extractionsystem 452 via software within a computer system. The documentprocessing system 130 can be implemented as a set of communicatingprograms or as a single integrated program.

System 410 is a service control manager. Service control manager 410 isa system that controls the state machine for each job. The state machineidentifies the different states and the steps that a job has to progressthrough in order to achieve its final objective, in this case being dataextracted from an electronic document. In the current system, theservice control manager 410 is designed to be highly scalable anddistributed. Under preferred embodiments, the service control manager410 is multi-threaded to handle hundreds or thousands of jobs at anygiven time. The service control manager 410 also implements messagequeues to communicate with other processes regarding their own states.Alternately, the service control manager 410 can be implemented in otherarchitectures; for example, one can implement a complete database drivenapproach to step through all the different steps required to processsuch a job.

In preferred implementations the service control manager 410 subscribesto events for each new incoming job that need to be processed. Once anew job arrives, the service control manager 410 pre-processes the jobby taking the electronic document and separating each image (or page)into its own bitmap image for further processing. For example, if anelectronic document had 30 pages, the system will create 30 images forprocessing. Each job in the system is given a unique identity.Furthermore, each page is given a unique page identity that is linked tothe job identity. After the service control manager 410 has createdimage files by pre-processing the document into individual pages, ittransitions the state of each page to image processing.

System 414 is a job database. Job database 414 is used to store theimages and data associated with each of the jobs being processed. A“job” is defined as a set of source documents and all intermediate andfinal processing outputs. Job database 414 can be file system storage, arelational database, XML document or a combination of these. Inpreferred implementations, job database 414 uses a file system storageto store large blob (binary large objects) and a relational database tostore pointers to the blobs and other information pertinent toprocessing the job.

System 422 is an image processing system. The image processing system422 removes noise from the page image and properly orients the page sothat document image analysis can be performed more accurately. Theaccuracy of the data extraction greatly depends on the quality of theimage; thus image processing is included under preferred embodiments.The image processing system 422 performs connected component analysisand, utilizing a line detection system, creates “confetti” images whichare small sections of the complete page image. Under preferredembodiments, the confetti images are accompanied by the coordinates ofthe image sub-section. The image processing system 422 is described ingreater detail below.

System 432 is a classification system. The classification system 432recognizes the page as one of a pre-identified set of types ofdocuments. A major difficulty in categorizing a page as one of a largenumber of documents is the high dimensionality of the feature space.Conventional approaches that depend on text categorization alone arefaced with a native feature space that consists of many unique terms(words as well as phrases) that occur in documents, which can behundreds or thousands of terms for even a moderate-sized collection ofunique documents. In one domain, multiple systems that categorize incometax documents such as W-2, 1099-INT, K-1 and other forms haveexperienced poor accuracy because of the thousands of variations of taxdocuments. The preferred implementation uses a combination of imagepattern recognition and text analysis to distinguish documents andmachine learning technology to scale to large numbers of documents. Theclassification system 432 is described in greater detail below.

System 442 is a grouping system. The grouping system 442 groups pagesthat have been categorized by the classification system 432 as specificinstances of a pre-identified set of types of documents into sets ofmulti-page documents. The grouping system 442 is described in greaterdetail below.

System 452 is a data extraction system. The data extraction system 452extracts data from pages that have been categorized by theclassification system 432 as specific instances of a pre-identified setof types of documents. There are many difficulties in extracting dataaccurately from documents not specifically designed for automatic dataextraction. Typically, the document images are not of uniformly highquality. The document images can be skewed, streaked, smudged, populatedwith artifacts and otherwise degraded in ways that cannot be fullycompensated by image processing. The document layout can appear to berandom. The relevant content (data labels and data values) can be quitesmall, impaired by lines and background shading or otherwise not beprocessed well by OCR. In the above-mentioned domain of tax documentautomation, vendors using conventional data extraction methods claim 50%or less data extraction and admit to numerous errors. The dataextraction system 452 uses OCR data extraction, non-OCR visualrecognition, contextual feature matching, business intelligence andoutput formatting, all with machine learning elements, to accuratelyextract and present data from a wide range of documents. The dataextraction system 452 is described in greater detail below.

FIG. 5 is an image processing system 422 according to a preferredembodiment of the invention. System 422 has an image feature extractionsystem 510, a working image database 522, an image identification system530, a trained image database 532 and an image training system 534. Theimage feature extraction system 510 is connected to the working imagedatabase 522 via software within a computer system. The image featureextraction system 510 is connected to the image identification system530 via software within a computer system. The image identificationsystem 530 is connected to the working image database 522 via softwarewithin a computer system. The image identification system 530 isconnected to the trained image database 532 via software within acomputer system. The image training system 534 is connected to theworking image database 522 via software within a computer system. Theimage training system 534 is connected to the trained image database 532via software within a computer system.

System 510 is an image feature extraction system. Image featureextraction system 510 extracts images from the submitted job artifacts.Image feature extraction system 510 normalizes images into a uniformconsistent form for further image processing. Image feature extractionsystem 510 binarizes color and grayscale images. A document can becaptured as a color, grayscale or binary image by a scanning device.Common problems seen in images from scanning devices include:

-   -   poor contrast due to lack of sufficient or controlled lighting    -   non-uniform image background intensity due to uneven        illumination    -   immoderate amounts of random noise due to limited sensitivity of        the sensors

Many document images are rich in color and have complex backgrounds.Accurately processing such documents typically requires time-consumingprocessing and manual tuning of various parameters. Detecting text insuch documents is difficult for typical optical character recognitionsystems that are optimized for binary images on clean backgrounds. Forthe data extraction system to work well, document images must bebinarized and the text must be readable. Typically, general purposescanners binarize images using global thresholding utilizing a singlethreshold value, generally chosen on statistics of the global image.Global thresholding is not adapted well for images that suffer fromcommon illumination or noise problems. Global thresholding often resultsin characters that are broken, merged or degraded; further, thousands ofconnected components can be created by binarization noise. Imagesdegraded by global thresholding are typically candidates for low qualitydata extraction.

The preferred embodiment of the binarization system utilizes localthresholding where the threshold value varies based on the local contentin the document image. The preferred implementation is built on anadaptive thresholding technique which exploits local image contrast(reference: IEICE Electronics Express, Vol. 1, No 16, pp. 501-506.) Theadaptive nature of this technique is based on flexible weights that arecomputed based on local mean and standard deviations calculated for thegray values in the primary local zone or window. The preferredembodiment experimentally determines optimum median filters across alarge set of document images for each application space. Reference“Systems and Methods for Handling and Distinguishing BinarizedBackground Artifacts in the Vicinity of Document Text and Image Featuresindicative of a Document Category” US 2009/0119296 A1.

The preferred embodiment of image feature extraction system 510 removesnoise in the form of dots, specks and blobs from document images. In thepreferred embodiment, minimum and maximum dot sizes to be removed arespecified. The preferred embodiment also performs image reversal so thatwhite text or line objects on black backgrounds are detected andinverted to black-on-white. The preferred embodiment also performs twonoise removal techniques.

The first technique starts with any small region of a binary image. Thepreferred implementation takes a 35×35 pixel region. In this region allbackground pixels are assigned value “0.” Pixels adjacent to backgroundare given value “1.” A matrix is developed in this manner. In effecteach pixel is given a value called the “distance transform” equal to itsdistance from the closest background pixel. The preferred implementationruns a smoothing technique on this distance transform. Smoothing is aprocess by which data points are averaged with their neighbors in aseries; this typically has the effect of blurring the sharp edges in thesmoothed data. Smoothing is sometimes referred to as filtering, becausesmoothing has the effect of suppressing high frequency signals andenhancing low frequency signals. Of the many different methods ofsmoothing, the preferred implementation uses a Gaussian kernel. Inparticular, the preferred implementation performs Gaussian smootheningwith a filter using variance of 0.5 and a 3×3 kernel or convolution maskon the distance transform. Thresholding with a thresholding value of0.85 is performed on the convolved images and the resulting data isconverted to its binary space.

The second technique uses connected component analysis to identify smallor bad blocks. In this method a sliding mask is created of a known size.The preferred implementation uses a mask that is 35×35 pixels wide. Thismask slides over the entire image and is used to detect the number ofblobs (connected components) that are less than 10 pixels in size. Ifthe number of blobs is greater than five, then all blobs are removed.This process is repeated by sliding the mask over the entire image.

Image feature extraction system 510 corrects skew, small angularrotations, in document images. Skew correction not only improves thevisual appearance of the document but also improves baselinedetermination, simplifies interpretation of page layout and improvestext recognition. Several available image processing libraries do skewcorrection. The preferred implementation of skew detection uses part ofthe open source Leptonica image processing library.

Image feature extraction system 510 corrects document orientation.Documents, originally in either portrait or landscape format may berotated by 0, 90, 180 or 270 degrees during scanning The preferredimplementation of orientation correction performs OCR on small words orphrase images at all four orientations: 0, 90, 180 and 270 degrees.Small samples are selected from a document and the confidence isaveraged across the sample. The orientation that has the highestconfidence determines the correct orientation of the document.

Image feature extraction system 510 performs connected componentanalysis using a very standard technique. The preferred implementationof connected component analysis uses the open source Image ProcessingLibrary 98 (IPL98.)

Image feature extraction system 510 detects text lines using thetechnique described by Okun et al. (reference: Robust Text Detectionfrom Binarized Document Images) to identify candidate text segmentsblocks of consistent heights. For a page from a book, this method mayidentify a whole line as a block, while for a form with many boxes thismethod will identify the text in each box.

Image feature extraction system 510 generates confetti information bystoring the coordinates of all of the text blocks in the working imagedatabase 522.

Image feature extraction system 510 performs image processing on theconfetti images. Traditionally, if image processing is performed ondocument images, the entire document image is subject to a single typeof image processing. This “single algorithm” process might, for example,thin the characters on the document image. In some cases, the accuracyof text extraction with OCR might improve after thinning; however, inother cases on the same document, the accuracy of text extractionaccuracy of text extraction with OCR might improve with thickening.Image feature extraction system 510 applies multiple morphologicaloperators to individual confetti images. Then, for each variation ofeach confetti image (including the original, unprocessed versions andall processed versions) image feature extraction system 510 extractstext with OCR. Optionally, image feature extraction system 510 extractstext with different OCR engines. Several OCR software programs areavailable on the market today. The preferred implementation usesTesseract, an open source software which allows custom modifications.The extracted text output (text, OCR engine used and correspondingconfidence value) is saved for each version of each confetti image. Anillustration of source document images before and after image processingis shown in FIG. 11.

Image feature extraction system 510 determines the contour of imageareas within confetti boxes. The contour of an image within a confettiis illustrated in FIG. 14. The size of the confetti image area is firstnormalized. In preferred implementations, 256 equidistant points on thecontour are chosen, and the relative location of these points isrecorded in a log-polar histogram as illustrated in FIG. 12. Values forlog r are placed in 3 bins, while values for the angle are placed in 8bins. The relative location of a point with respect to another istherefore a number from 1 through 24.

The feature vector for the shape of the contour as illustrated in FIG.14 is a 256×256 matrix of numbers from 1 through 24 that considering allthe 256 points and their relative locations (reference: IEEETransactions on Pattern Analysis and Machine Intelligence, Vol. 24, No24, pp. 509-422.)

System 522 is a working image database. Working image database 522 isused to support both the processing of jobs and the image trainingsystem 534. Working image database 522 can be a file system, arelational database, a XML document or a combination of these. Inpreferred implementations, the working image database 522 uses a filesystem to store large blobs (binary large objects) and a relationaldatabase to store pointers to the blobs and other information pertinentto processing the job.

System 530 is an image identification system. The image identificationsystem 530 looks for point and line features. The preferredimplementation performs image layout analysis using two imageproperties, the point of intersection of lines and edge points, of textparagraphs. Every unique representation of points is referred as aunique class in the system and represents a unique point pattern in thesystem database. The preferred implementation uses a heuristicallydeveloped convolution method only on black pixels to perform a fastercomputation The system identifies nine types of points: four T's, fourL's, and one cross (X) using nine masks; examples of these nine pointpatterns are shown in FIG. 10.

The preferred implementation of point pattern matching is performed bycreating a string from the points detected in the image and then usingthe Levenshtein distance to measure the gap between the trained set withthe input image. The Levenshtein distance between two strings is givenby the minimum number of operations needed to transform one string intothe other, where an operation is an insertion, deletion, or substitutionof a single character.

The image identification system 530 selects the extracted text from thesets of extracted text for each confetti image according to rules storedin the trained image database 532. In preferred implementations of theimage identification system 530, extracted text values that exceedspecified OCR engine-specific thresholds are candidates for selection.The best text value that is produced from the image after applying themorphological operators is chosen based on OCR confidence, similarityand presence in a dictionary.

In preferred implementations, based on the results of the “first pass”OCR performed by the image feature extraction system 510, the imageidentification system 530 selects the text value from a contextuallylimited lexicon (words and characters) that is stored in the trainedimage database 532. In preferred implementations, the imageidentification system 530 requests the image feature extraction system510 to perform a “second pass” OCR operation using an enginespecifically tailored for extracting the type of characters that theimage identification system 530 identified as present in the confettiimage. As an example, if the image identification system 530 identifiedthe confetti image as containing characters associated only withcurrency values (such as the digits 0-9, dollar sign, period, comma,minus sign, parentheses and asterisk) then the “second pass” OCR wouldbe conducted with a currency character recognition system that is tunedto identify numerical and certain special characters. The currencycharacter recognition system utilizes OCR technology tailored to thereduced character set associated with currency values. In the preferredimplementation, the currency character set is defined as the digits[0-9] and the special character set [$.,-( )*]. The preferredimplementation performs character segmentation to break up the imageinto individual characters. It then uses a normalized bitmap of theimage of each character as a feature vector. This feature vector ispassed into a neural network based classifier that was trained on morethan 10,000 instances of each character that are stored in the trainedimage database 532.

Label identification by traditional means of matching extracted text toa database of expected values is often not possible; this is caused bythe inability of OCR engines to accurately extract text from very smalland degraded images. The present invention's use of both multipleversions of the confetti images (original and image processed) andmultiple OCR engines significantly reduces but does not eliminate theproblem of inaccurate text extraction. Two additional techniques areused to identify text from images.

The image identification system 530 performs contour matching bycomparing the contour shape features extracted by the feature extractionsystem 510, with the corresponding features of known confetti imagesstored in the trained image database 532. Similarity between images isdetermined by a point-wise comparison of feature vectors. The preferredimplementation uses a KNN classifier for this process. FIG. 14illustrates label contour matching.

System 532 is a trained image database. Trained image database 532 isused to support both the processing of jobs and the image trainingsystem 534. Trained image database 532 can be a file system, arelational database, a XML document or a combination of these. Inpreferred implementations, the trained image database 532 uses a filesystem to store large blobs (binary large objects) and a relationaldatabase to store pointers to the blobs and other information pertinentto processing the job. As the system grows “smarter” by recognizing moreimages and more rules pertaining to restricting OCR with contextualinformation, the trained image database 532 grows. As the machinelearning system sees more trained images, its image identificationaccuracy increases.

System 534 is an image training system. The image training system 534performs computations on the data in its document database correspondingto the image that are in place and generates datasets used by the imageidentification system for recognizing the content in source documentimages. The results of the training and re-training process are imagedatasets that are updated in the trained image database 532.

The image training system 534 implements a continuous learning processin which images and text that are not properly identified are sent totraining The training process results in an expanded data set in thetrained image database 532, thereby improving the accuracy of the systemover time. As the trained image database 532 grows, the system requiresan asymptotically lower percentage of images to be trained. Preferredimplementations use machine learning supported by the image trainingsystem 534 that adapts to a growing set of documents images. Additionaldocuments add additional image features that must be analyzed.

The learning system receives documents from the working image database522 that were provided by the image identification system 530. Thesedocuments are not trained and do not have corresponding model data inthe trained image database 532. All such documents are made persistentin the trained image database 532.

Preferred implementations of the training system include tuning andoptimization to handle noise generated during both the training phaseand the testing phase. The training phase is also called learning phasesince the parameters and weights are tuned to improve the learning andadaptability of the system by fitting the model that minimizes the errorfunction of the dataset.

The learning technique in the preferred implementation is supervisedlearning. Applications in which training data comprises examples ofinput vectors along with their corresponding target vectors are known assupervised learning problems. Example input vectors include key wordsand line patterns of the document layouts. Example target vectorsinclude possible classes of output in the organized document. Supervisedlearning avoids the unstable states that can be reached by unsupervisedlearning and reinforcement learning systems.

FIG. 6 is a classification system 432 according to a preferredembodiment of the invention. System 432 has class feature extractionsystems 610, working class databases 622, class identification systems630, trained class databases 632, class training systems 634, a votingsystem 640, a trained voting decision tree 642 and a voting trainingsystem 644. The class feature extraction system (i) 610 is connected tothe working class database (i) 622 via software within a computersystem. The class feature extraction system (i) 610 is connected to theclass identification system (i) 630 via software within a computersystem. The class identification system (i) 630 is connected to theworking class database (i) 622 via software within a computer system.The class identification system (i) 630 is connected to the trainedclass database (i) 632 via software within a computer system. The classtraining system (i) 634 is connected to the working class database (i)622 via software within a computer system. The class training system (i)634 is connected to the trained class database (i) 632 via softwarewithin a computer system. The class identification system (i) 630 isconnected to the voting system 640 via software within a computersystem. The voting system 640 is connected to the trained votingdecision tree 642 via software within a computer system. The trainedvoting decision tree 642 is connected to the voting training system 644via software within a computer system.

Under the preferred embodiment, classification system 432 is composed offour classification subsystems whose outputs are evaluated by the votingsystem 640. The four classification subsystems are:

-   -   Combined text and image (CTI) classification subsystem    -   CRK classification subsystem    -   SVM classification subsystem    -   CCS classification subsystem

Each of the above subsystems has a class feature extraction systems 610,a working class database 622, a class identification system 630, atrained class database 632 and a class training system 634.

Each system 610 is a class feature extraction system. Class featureextraction systems 610 receive extracted text and image features(discussed above.) The CTI classification subsystem and the CRKclassification subsystem use the extracted text features.

The SVM classification subsystem addresses the problem of classifyingdocuments as OCR results improve; as document quality, scanningpractices, image processing or OCR engines improve, the extracted sourcedocument text from differs from the extracted text of trainingdocuments, causing classification to worse. The SVM class featureextraction systems 610 filters extracted text features, passing on onlythose text features that match a dictionary entry.

In the preferred implementation, the SVM class feature extraction system610 matches OCR text output of a text document against a largedictionary. If no dictionary match is found, the OCR text is discarded.A feature vector that consists of all OCR text that matches thedictionary is passed to an SVM-based classifier to determine thedocument class.

The SVM classification subsystem is made resilient to OCR errors byintroducing typical OCR errors into the dictionary. However, theclassifier remains robust to OCR improvements because the dictionaryincludes correct English words.

The CCS classification subsystem addresses the problem of classifyingdocuments with poor image quality that do not OCR well; such documentshave poor text extraction and therefore poor text-based classification.The CCS classification subsystem uses robust image features exclusivelyto classify documents.

In the preferred implementation, the CCS class feature extraction system610 first creates a code book using seven randomly selected documents.Each of these documents is divided into 10×10 pixel blocks. The K-meansalgorithm is applied to each block to generate 150 clusters. The mean ofthese clusters is taken as the representative codeword for that cluster.The clusters are arbitrarily numbered from 1 to 150; the result forms avocabulary for representing source document images as a feature vectorof this vocabulary.

Each source document image is divided into four quadrants. A vector isformed for each quadrant following the term frequency inverse documentfrequency (TF-IDF) model. At the classification step, a K-means approachis used. A test document is encoded to the feature vector form, and itsEuclidean distance is computed from each of the clusters. The labels ofthe closest clusters are assigned to the document.

Each system 622 is a working class database. Working class databases 622are used to support both the processing of jobs and the class trainingsystems 634. Working class databases 622 can be file systems, relationaldatabases, XML documents or a combination of these. In preferredimplementations, the working class databases 622 use file systems tostore large blobs (binary large objects) and relational databases tostore pointers to the blobs and other information pertinent toprocessing the job.

System 630 is a class identification system. Class identification system630 functions differently for each of the four classificationsubsystems.

In the first case, the CTI classification subsystem, the classidentification system 630 presents the extracted text to a key wordidentification system. The key word identification system receives theconfetti text and interfaces with the trained class database 632. Thetrained class database 632 consists of a global dictionary, globalpriority words and the point pattern signatures of all the trainedforms, all of which are created by the class training system 634.

Under the preferred embodiment, stop words are from the list ofextracted. Stop words are common words—for example: “a,” “the,” “it,”“not,” and, in the case of income tax documents, for example, phrasesand words including “Internal Revenue Service,” “OMB,” “name,”“address,” etc. The stop words are provided by the trained classdatabase 632 and, in the preferred embodiment, are domain specific.

In the preferred implementation, the priority of each word is calculatedas function of line height (LnHt) of the word, partial of full match(PFM) with form name and total number of words in the form (N). Theapproximate value of priority is formulated as

Pr=(ΣLnHt*PFM)/N

The summation is taken to give more priority to the word whose frequencyis higher in a particular form. Partial or full match (PFM) increasesthe priority if the word partially or fully matches the form name. Thecalculation divides by the total number of words in the form (N) tonormalize the frequency if the form has a large numbers of words.

The vector space creation system stores in a table the priority of eachword in the form. A vector is described as (a1, a2, . . . ak) where a1,a2 . . . ak are the magnitude in the respective dimensions. For example,for input words and corresponding line heights of a W-2 tax form, thefollowing are word-priority vectors are stored:

OMB 10 employer 5 employer 5 wages 5 compensation 5 compensation 5dependent 5 wages 10 social 5 security 5 income 5 tax 5 federal 5 name 5address 5

The normalized valued for the priorities are:

OMB 0.666667 employer 0.666667 wages 1.000000 compensation 0.666667dependent 0.333333 social 0.333333 security 0.333333 income 0.333333 tax0.333333 federal 0.333333 name 0.333333 address 0.333333

In such a vector space, the words with larger font size or higherfrequency will have higher priority.

The ranking system calculates the cosine distance of two vectors V1 andV2 as:

cos θ=(V1·V2)/(|V1|*|V2|)

where V1·V2 is the dot product of two vectors and |V| represents themagnitude of the vector. When the cosine distance nears 0, that meansthe vectors are orthogonal and when it nears 1 it means the vectors arein the same direction or similar.

The class which has the maximum cosine distance with the form is theclass to which the form is classified.

The class identification system 630 performs point pattern matchingbased on the image features collected during image processing. Asmentioned earlier, the point pattern matching of documents is performedby creating a string from the points detected in the image and thenusing Levenshtein distance to measure the gap between the trained setwith the input image.

In the preferred embodiment of the CTI classification subsystem, theresults of the ranking and the point pattern matching are used todetermine the class matching values. If the system is not successful infinding a class match within a defined threshold, the document is markedas unclassified.

In the second case, the CRK classification subsystem, the classidentification system 630 first identifies a source document as a memberof a particular group of classes then identifies the source document asa member of a particular individual class. The CRK class identificationsystem 630 performs hierarchical classification with a binary classifiersystem using regularized least squares and a multi-class classifierusing K-nearest neighbor. An example flow diagram of an example CRKclass identification system 630 used in classifying income tax documentsis shown in FIG. 15.

In the third case, the SVM classification subsystem, the classidentification system 630 identifies a source document using a supportvector machine operating on a set of trained data If the lookup fails,the source document is marked as unclassified.

In the fourth case, the CCS classification subsystem, the classidentification system 630 works much like the CTI class identificationsystem 630. The CCS class identification system 630 compares the codevectors for each quadrant of source documents with code vectors in thetrained class database 632 using the K-means approach. The trained classdatabase 632 is organized into clusters representing documents in thetraining set with similar image properties as defined by the featurevectors. The mean point of each cluster within the feature vector spaceis used to represent each cluster. In addition, each cluster is taggedwith all document classes that occurred within the cluster. The distanceof the feature vector of a source document from the mean of each clusteris computed, and the K nearest clusters are considered. The documentclass tags of these clusters are chosen as plausible classes of thesource document.

The CCS trained class database 632 stores code vectors of all thetrained forms, all of which are created by the CCS class training system634.

System 632 is a trained class database. Trained class database 632 isused to support both the processing of jobs and the class trainingsystem 634. Trained class database 632 can be a file system, arelational database, a XML document or a combination of these. Inpreferred implementations, the trained class database 632 uses a filesystem to store large blobs (binary large objects) and a relationaldatabase to store pointers to the blobs and other information pertinentto processing the job. As the system grows “smarter” by recognizing moredocuments, the trained class database 632 grows. As the machine learningsystem sees more classification data, its classification accuracyincreases.

System 634 is a class training system. The class training system 634adapts to a growing set of documents; additional documents addadditional features that must be analyzed. Preferred implementations ofthe class training system 634 include tuning and optimization to handlenoise generated during both the training phase and the testing phase.The training phase is also called learning phase since the parametersand weights are tuned to improve the learning and adaptability of thesystem by fitting the model that minimizes the error function of thedataset.

The learning technique that is used to bootstrap the system in thepreferred implementation is supervised learning. Applications in whichtraining data comprises examples of input vectors along with theircorresponding target vectors are known as supervised learning problems.Example input vectors include key words and line patterns of thedocument layouts. Example target vectors include possible classes ofoutput in the organized document. Supervised learning avoids theunstable states that can be reached by unsupervised learning andreinforcement learning systems.

To maintain the system, semi-supervised learning is utilized. In thepreferred implementation, data that is flowing through the system isanalyzed and those data that the system failed to correctly identify areisolated. These data are passed through a retraining phase, and thetraining data in the system are updated after appropriate regressiontesting.

In high volume, a fully automated process is utilized. Here, the datathat is needed for retraining are automatically identified and fed tothe retraining phase. The new training data are automatically injectedinto a regression test system to ensure correctness. If the regressiontest passes, the production system is automatically updated with the newtraining data.

The learning system receives documents from the trained class database632. These documents are not trained and do not have correspondingclassification model data in the class database. All such documents aremade persistent in the trained class database 632.

The trained class database 632 has several tables which contain thedocument class information as well as image processing information(which is discussed in greater detail below.) The following tables arepart of training database:

Form class (classification view)

Page table (details of the page of the electronic document)

Manual classification table (manual work information)

Manual training table (trainers' information)

Confetti table (confetti information, original text, corrected text,etc.)

Class training system 634 utilizes a training process management systemthat manages the distribution of the training task. Under preferredembodiments, a user, called a “trainer,” logs into the system in whichthe trainer has privileges at one of three trainer levels:

-   -   Top tier: add new classes to the system and perform        classification and training    -   Middle tier: perform manual classification and training    -   Bottom tier: only perform training (manual text correction).

The training process manager directs document processing based on thedocument state:

-   -   Unclassified page is scheduled for manual classification    -   Manual classification is done as per policy and form class is        assigned    -   Job database is updated with form class information and page/job        states are changed so that the page can go to next state    -   If the form class state is not trained, the form is scheduled        for training, else no action is needed

After form training, the form class state is changed to trained, notsynched if allowed by policy. The document class has the followingstates:

-   -   Untrained    -   Partially trained    -   Trained, need synch with classification database    -   Trained, synched with classification database

Each document that requires training is manually identified and theextracted text is corrected as needed. The trainer follows twoindependent steps:

-   -   Manually classifying the form and assigning a class and subclass    -   Manually correcting text extracted by OCR (name required        training for now)    -   Manual identification and text correction is comprised of a        number of steps:    -   Receive pages from the training manager which manages the flow        of pages between various trainers and implements training policy        and restrictions    -   Manual classification user interface (UI) which presents the        page and asks the user to classify it    -   Manual text correction UI which presents the page with marked up        confetti; the user views the confetti and corrects the text        extracted from the confetti    -   Training viewer UI is used to view the training database in an        UI; the preferred implementation includes reports and        representations of the training database    -   Classification verification UI presents a page and its        classification to a trainer

All user interfaces are integrated into a single system.

The class training system 634 combines the document image, the manuallyclassified information and the corresponding text.

New trained data that passes regression testing is inserted by the classtraining system 634 into the trained class database 632.

In the case of the CRK class training system 634, Ch-square featureselection attempts to select the most relevant keywords (bag-of-words)for each class

$\chi^{2} = \frac{{N\left( {{AD} - {BC}} \right)}^{2}}{\left( {A + C} \right)\left( {B + D} \right)\left( {A + B} \right)\left( {C + D} \right)}$

Where

A=number of times word t co-occurs with class c

B=number of times word t occurs without class c

C=number of times class c occurs without word t

D=number of times neither word t nor class c occur

N=total number of words

This approach ranks the relevance of each word for a particular class sothat a sufficient number of features are obtained.

Term frequency—inverse document frequency is used to represent eachdocument:

${{tf}_{i} = \frac{n_{i}}{\sum_{k}n_{k}}},{{idf}_{i} = {\log \frac{D}{\left\{ {d:{d \in t_{i}}} \right\} }}}$

Where

n_(k)=number of occurrences of feature keyword i

${\sum\limits_{k}{nk}} = {{number}\mspace{14mu} {of}\mspace{14mu} {occurrences}\mspace{14mu} {of}\mspace{14mu} {all}\mspace{14mu} {terms}\mspace{14mu} {in}\mspace{14mu} {the}\mspace{14mu} {document}}$

|D|=total number of documents in the data

|D[d:dεt∩]|=total number of documents in the data

Each vector is normalized into unit Euclidean norm.

In the tax document classification example shown in FIG. 15, using thesefeatures, four regularized least square classifiers are trained fororganizer, brokerage, IRS and misc categories at level 1. Finally, a KNNclassifier is used to refine the IRS classes. The cosine distance isused as a similarity measure.

System 640 is a voting system. The voting system 640 uses the output ofeach of the classifier subsystems 630 to choose the best classificationresult for an image, based on empirical observations of each classifiersubsystem behavior on a large training dataset. These empiricalobservations are encoded into a trained voting decision tree 642. Thevoting system 640 uses the trained voting decision tree 642 to choosethe final classification of an image. The trained decision tree 642 isbuilt using the voting training system 644.

System 642 is a trained voting decision tree. The trained votingdecision tree 642 is used to support the voting system 640. Trainedvoting decision tree 642 can be encoded as part of a program, file,relational database, XML document or a combination of these. Inpreferred implementations, the trained voting decision tree 642 isencoded as a program within a decision making process. As the systemgrows “smarter” by recognizing more images, the trained voting decisiontree 642 evolves, resulting in a system with increasing imageidentification accuracy.

System 644 is a voting training system. The voting training system 640considers the real classifications of a training dataset and therespective outputs of each of the classifier subsystems 630. Using thisdata, the voting training system 640 builds a decision tree, givingappropriate weights and preference to the correct results of each of theclassification subsystems 630. This approach results in maximizedcorrectness of final classification, especially when each classificationsubsystem 630 is adept at classifying different, not necessarilydisjoint, subsets of documents.

FIG. 7 is a grouping system 442 according to a preferred embodiment ofthe invention. System 442 has a group feature extraction system 710, aworking group database 722, a group identification system 730, a trainedgroup database 732 and a group training system 734. The group featureextraction system 710 is connected to the working group database 722 viasoftware within a computer system. The group feature extraction system710 is connected to the group identification system 730 via softwarewithin a computer system. The group identification system 730 isconnected to the working group database 722 via software within acomputer system. The group identification system 730 is connected to thetrained group database 732 via software within a computer system. Thegroup training system 734 is connected to the working group database 722via software within a computer system. The group training system 734 isconnected to the trained group database 732 via software within acomputer system.

System 710 is a group feature extraction system. Group featureextraction system 710 receives document information including the classidentifier and text data for each page. System 710 identifies datafeatures that potentially indicate that a page belongs to a documentset. The preferred implementation identifies page numbers and accountnumbers.

System 722 is a working group database. Working group database 722 isused to support both the processing of jobs and the group trainingsystem 734. Working group database 722 can be a file system, arelational database, a XML document or a combination of these. Inpreferred implementations, the working group database 722 uses arelational database to store pointers to the information pertinent toprocessing the job.

System 730 is a group identification system. Group identification system730 utilizes the class identifier, the page numbers and the accountnumbers extracted by system 710 to group pages of a job that belongtogether. The preferred implementation uses an iterative groupingprocess that begins by assuming that all pages belong to independentgroups. At each iteration step, the process attempts to merge existinggroups using a merging confidence. The process terminates when groupmembership converges and there is no further change to the set ofgroups.

The group identification system 730 uses a merging confidence that isdetermined from matching and mismatching criteria that is stored in thetrained group database 732. Matching criteria between two groupscontribute towards an increased confidence to merge the groups, whilemismatching criteria contribute towards keeping the groups separate. Thefinal merging confidence is used to decide whether to merge the twogroups. This process is repeated for every pair of groups, in eachiteration step of the process.

System 732 is a trained group database. Trained group database 732 isused to support both the processing of jobs and the group trainingsystem 734. Trained group database 732 can be a file system, arelational database, a XML document or a combination of these. Inpreferred implementations, the trained group database 732 uses a filesystem to store large blobs (binary large objects) and a relationaldatabase to store pointers to the blobs and other information pertinentto processing the job. As the system grows “smarter” by recognizing moredocument group data, the trained group database 732 grows. As themachine learning system sees more data, its group identificationaccuracy increases.

System 734 is a group training system. The group training system 734extracts matching criteria from a large set of correctly groupeddocuments and adapts to a growing set of document data. Preferredimplementations of the group training system 734 include tuning andoptimization to handle noise generated during both the training phaseand the testing phase. The training phase is also called learning phasesince the parameters and weights are tuned to improve the learning andadaptability of the system by fitting the model that minimizes the errorfunction of the dataset.

FIG. 8 is a data extraction system 452 according to a preferredembodiment of the invention. System 452 has a data feature extractionsystem 810, a working data database 822, a data identification system830, a trained data database 832 and a data training system 834. Thedata feature extraction system 810 is connected to the working datadatabase 822 via software within a computer system. The data featureextraction system 810 is connected to the data identification system 830via software within a computer system. The data identification system830 is connected to the working data database 822 via software within acomputer system. The data identification system 830 is connected to thetrained data database 832 via software within a computer system. Thedata training system 834 is connected to the working data database 822via software within a computer system. The data training system 834 isconnected to the trained data database 832 via software within acomputer system.

System 810 is a data feature extraction system. The data featureextraction system 810 constructs an Image Form Model, which is a workingrepresentation of the layout of the confetti and text in the documentimage. The data feature extraction system 810 identifies layout featuresthat potentially carry data. The preferred implementation identifiesboxes (illustrated in FIG. 19), check boxes (illustrated in FIG. 20),text, lines and tables. The Image Form Model also contains references tothe image features like lines and points that have been identifiedearlier.

The data feature extraction system 810 identifies canonical labels thatoccur in an image by searching through the extracted text data forcorresponding expected labels. In order to be robust to OCR errors, datafeature extraction system 810 utilizes inexact string matchingalgorithms that use Levenshtein distance to identify expected labels. Aniterative technique that uses increasingly inexact string comparison onan increasingly narrower search space is utilized. If certain canonicallabels are still not found because of severe OCR errors, imageidentification system 530 is used to find canonical labels using contourmatching. The success of this technique is enhanced by the narrowedsearch for the corresponding missing expected labels.

The data feature extraction system 810 identifies data-containingfeatures including boxes, real and virtual, check boxes, label-valuepairs, and tables. The data feature extraction system 810 alsoidentifies formatted data that are often not associated with a label,e.g. address blocks (illustrated in FIG. 21), phone numbers and accountnumbers. The data feature extraction system 810 also identifies regionsof text that are not associated with any data, such as disclaimers andother text blocks that contain instructions for the reader rather thanextractable data (referred to as instruction blocks and illustrated inFIG. 22).

System 822 is a working data database. Working data database 822 is usedto support both the processing of jobs and the data training system 834.Working data database 822 can be a file system, a relational database, aXML document or a combination of these. In preferred implementations,the working data database 822 uses a file system to store large blobs(binary large objects) and a relational database to store pointers tothe blobs and other information pertinent to processing the job.

The working data database 822 consists of a flexible data structure thatstores all of the features that the data feature extraction system 810identifies along with the spatial relationships between them. The mostprimitive element of the data-structure is a Feature data-structure,which is a recursive data-structure that contains a set of Features. AFeature also maintains references to nearby Features; in the preferredimplementation, four sets that correspond to references to Featuresabove, below, to the left, and to the right of the Feature. A Featureprovides iterators to traverse the five sets associated with it. AFeature also provides the ability to tag on a confidence metric. In thepreferred implementation, the confidence is an integer in the range[0-100]. It is assigned by the algorithms that create the Feature, andis used as an estimate of the accuracy of the extracted data.

The primitive Feature data-structure is sub-classed into specificfeatures. At the lowest level are the primitive features confetti, word,line, point. At the next level are label and value. Finally, there arefeatures corresponding to each of the data-containing features, box,check-box, label-value pair, and table. There also are featurescorresponding to the elements of certain composite features like tableheaders, table rows, and table columns. There are also featurescorresponding to form-specific items such as address blocks, phonenumbers, and instruction blocks.

The Feature data-structure supports operations to merge a set offeatures into another. For example, a label feature and a value featurethat correspond to each other are merged into a Label-value pairfeature. A set of value features that have been identified as a row of atable are merged into a row feature. A set of labels that have beenidentified as a table header are merged into a table header feature. Ineach of these cases, the set of features that were merged into theresult are all contained within. They are accessed by enumerating thecontained features. As with any feature, the respective algorithm canassign a confidence to the merged feature.

System 830 is a data identification system. Data identification system830 utilizes the Image Form Model created by system 810 to search forcorrelations between labels and values. The preferred implementationuses the classification of a particular page to determine the expectedlabels. The expected label set is a subset of the universe of labels,which is available in the trained data database 832. System 830 uses theexpected label set to search for data in the image form model for theimage. The layout features that have been identified in System 810 areused to aid the process of correlating labels with data.

The data identification system 830 performs relative location matchingby comparing the locations of the identified confetti images withlocations of unidentified confetti images, both stored in the workingdata database 822. FIG. 17 illustrates relative matching of labels.

The preferred implementation of data identification system 830 includesthe ability to handle errors and noise. In some situations, poor imagequality results in certain expected labels to be missing. Dataidentification system 830 uses relative location matching by comparingthe relative location of identified labels and unidentified text in theimage form model, with learned data in the trained data database 832.

Some images include multiple copies of form data. For example, in theimage of a Form W-2 shown in FIG. 24, the data to be extracted isrepeated four times. FIG. 50 illustrates four “Wages, tips, other comp.”boxes that appear on a W-2 form; FIG. 51 show the corresponding datarecord. The data identification system 830 improves the accuracy of dataextraction by utilizing each copy of data on an image with the followingprocess for extracting data from multi-copy forms:

-   -   1. Identify if the image is a multi-copy form.    -   2. Extract data as with a single-copy form to get sets of        canonical label-value pairs.    -   3. Group the extracted data into records corresponding to the        layout of the multiple copies in the image:        -   a) Count the number of occurrences of each canonical label            extracted in step 2.        -   b) The maximum number of occurrences m determined in step 3a            is the number of records.        -   c) Create the m records.        -   d) Seed each record with the corresponding canonical            label-value pairs that determined the number of records.        -   e) Set the boundary of each of the records to be the            rectilinear convex hull of the canonical label—value pair            that seeded it.        -   f) Add the remaining extracted canonical label-value pairs            to the records:            -   (i) Sort all canonical label value pairs extracted in                raster order.            -   (ii) For each canonical label value pair not yet added                to a record                -   If the canonical label value pair is enclosed by the                    record;                -   Add the canonical label-value pair to the record;                -   Continue.            -   Otherwise                -   Add the canonical label value pair to a nearby                    record;                -   Extend the boundary of the record to be the                    rectilinear convex hull of the record and the new                    canonical label-value pair.                -   If the resulting boundary intersects with another                    record, backtrack and then add the canonical                    label-value pair to the next record.

After the above process, data identification system 830 organizes thedata extracted from such multi-form images into a set of m records asindicated by the layout. Accuracy of extracted data is improved by usinga voting strategy to determine which of the m extracted data to select.In addition, if all extracted data instances are identical, then theextracted data is considered to be correct with high confidence.Conversely, if extracted data instances are different, then theextracted data is flagged.

The data identification system 830 extracts data from tables(illustrated in FIG. 23) using a layout based strategy. The strategyaddresses the following problems with extracting data from tables.

-   -   1. Table headers often have poor OCR relative to the actual data        in the tables. This means that it is often the case that the        values can be correctly determined by the machine, but the        corresponding label cannot.    -   2. Table formats change.    -   3. A single table row can span multiple text lines in the image.        Conventional approaches to extract tables do not handle such        wrapped tables in a robust manner.    -   4. Tables are often interspersed with instruction blocks,        aggregate rows, incomplete rows, and overlapping columns.    -   5. An image with localized noise can still contain large amounts        of extractable table data.

The process for extracting tables is given below:

-   -   1. Start with the image as confetti (FIG. 25-A).    -   2. Find neatly aligned columns, which are a set of vertically        aligned confetti    -   3. Identify labels in the image, then consume all confetti        within the label area (FIG. 25-B).    -   4. Identify potential table headers by grouping labels        horizontally, consume all confetti in the header area (FIG.        25-C).    -   5. Remove instruction blocks from consideration. These areas do        not correspond to any extractable data and are identified using        heuristics associated with text density, font size and font type        (FIG. 27).    -   6. Remove noisy confetti (confetti with poor OCR, overlaid text,        and other situations where the data is bad or does not exist)        (FIG. 28).    -   7. Form horizontal projections of the remaining confetti. Use        the gaps in the projection data to identify rows (FIG. 29).    -   8. Collect rows that are in close proximity as candidate table        formations.    -   9. Grow columns within each candidate table formation using gaps        in their vertical projections until an obstruction is hit. Break        the table formation at that point (FIG. 30).    -   10. Associate the table formation with a header if possible        (FIG. 31).    -   11. Associate each column with a label.    -   12. Identify missing labels by matching the pattern of labels to        those in the trained data database 832 (FIG. 32).

The data identification system 830 handles wrapped columns as a specialcase. In step 8 above, if tables break repeatedly at a row count of one,then the rows are partitioned into two sets, the odds and evens. Nowsteps 7 through 11 operate on each of the two sets to get twointerleaved tables. These two interleaved tables are merged to form theextracted table.

System 832 is a trained data database. Trained data database 832 is usedto support both the processing of jobs and the data training system 834.Trained data database 832 can be a file system, a relational database, aXML document or a combination of these. In preferred implementations,the trained data database 832 uses a file system to store large blobs(binary large objects) and a relational database to store pointers tothe blobs and other information pertinent to processing the job. As thesystem grows “smarter” by recognizing more document data, the traineddata database 832 grows. As the machine learning system sees more data,its data identification accuracy increases.

The trained data database 832 contains information that is used toextract data. The trained data database 832 includes:

-   -   1. For each type of form, a set of canonical labels associated        with each data element that should be extracted from that type        of form. Examples of canonical labels for a Form W-2 include the        Social Security Number, Taxpayer Name, Wages, and Federal Income        Tax Withheld.    -   2. For each canonical label, a set of expected labels that        correspond to learned variations of the canonical label.        Examples of variations in expected labels for the Social        Security Number canonical label are Social Security Number, Soc.        Security No. and SSN.    -   3. For each type of form, the learned variations in the relative        locations of expected labels.    -   4. For each type of form, the learned variations in the types of        data containing features that may occur. The data containing        features include boxes, virtual boxes, check boxes, text, lines,        and tables.

System 834 is a data training system. The data training system 834adapts to a growing set of document data; additional document data addadditional features that must be analyzed. Preferred implementations ofthe data training system 834 include tuning and optimization to handlenoise generated during both the training phase and the testing phase.The training phase is also called learning phase since the parametersand weights are tuned to improve the learning and adaptability of thesystem by fitting the model that minimizes the error function of thedataset.

The invention extracts data from an image via a process of progressiverefinement and reduced character set OCR (as illustrated in FIG. 33) inorder to overcome the imperfections of OCR or low quality documents. Thescanned image is processed by generic OCR which, in this example,produces errors in both the label portion and the value portion of thebox. However, using standard techniques, the OCR output for the labelportion is correctly identified as “Medicare Tax Withheld”. In thisexample, the value related to the identified label is known to be amonetary amount, so the part of the image that corresponds to the valueis reprocessed by a restricted-character-set OCR. This OCR process istrained to identify only the characters possible in a monetary amount,i.e. the digits [0-9], and certain special characters [$,. ( )-]. Thereduced search space greatly increases the accuracy of therestricted-character-set OCR output, and it produces the correct valueof 131.52.

The invention extracts data from an image via a process of progressiverefinement that utilizes a reduced search space as more is learned aboutthe form being extracted (as illustrated in FIG. 34). In the exampleshown in FIG. 34, poor OCR is used to identify the correct label. First,the OCR output is used to identify the class of the form becauseclassification process is very robust to poor OCR. After the form hasbeen determined to be, for example, a W-2, the label search isconstrained to only the labels that are expected in W-2 forms. Thisgreatly reduces the search space, and therefore increases the accuracyof extraction.

In general, as more information is known about a form, constraints areadded to reduce the search space. This reduction in search space permitsprior processes to be rerun, significantly improving the overallextraction accuracy.

The invention extracts data from an image via a process of progressiverefinement that utilizes data external to the form image being extracted(as illustrated in FIG. 35). In the example shown in FIG. 35, data thatwas extracted from the 1099-OID form is used to extract data from the1099-G form. The Recipient's identifier number of the 1099-G form islight and washed out, and results in poor OCR output. In this example,the two forms are in the same job, and they both have the sameRecipient's name (John Smith). The Recipient's identification number onthe 1099-G form can be inferred to be 432-10-9876, the same as theRecipient's identification number on the 1099-OID form.

The invention extracts data from an image via a process of progressiverefinement that utilizes data not extracted from any image (asillustrated in FIG. 36). In the example shown in FIG. 36, data that isavailable in a “pro-form a” file is used to identify data on a form. Thepro-form a file contains taxpayer information from the previous year'stax return that has been quality checked, including the taxpayer name,taxpayer Social Security Number, spouse name, spouse Social SecurityNumber, dependent names and Social Security Numbers, and otherinformation about the tax forms included in the previous year's taxreturn. All this information is available to the data extractionprocess, and is assumed to be accurate. The pro-form a external dataenables the verification and correction of low-confidence OCR-extracteddata.

The invention utilizes a set of known-value databases to augment theresults of conventional data extraction methods such as OCR. Theknow-value databases are obtained from vendors or public sources; theknown-value databases are also built from data extracted from forms thathave been submitted by users of the data extraction system. Known-valuedatabases, for example, contain information on employers, banks andfinancial institutions and their corresponding addresses andidentification numbers. FIG. 37 shows a 1099-G form in which the payer'sname is struck out, making it difficult to OCR correctly. As can be seenin FIG. 38, the payer's name has not been extracted because of themissing label. A known-value database of the issuers of 1099-G forms(which are the revenue departments of the 50 states) provides thepayer's name by a simple lookup. This finding is verified by comparingthe lookup results against the relevant OCR output.

The invention utilizes known constraints between the semantics ofextracted data elements to identify potentially incorrectly extracteddata. The constraints are specified by subject matter experts (forexample, bankers in the case of loan origination forms); the constraintsare also determined by analysis of data extracted from forms that havebeen submitted by users of the data extraction system. For example, FIG.39 is an image of a W-2 form with a faded digit in the value for box 1“Wages, tips and other compensation.” As shown in FIG. 40, the extractedvalue corresponding to the “Wages, tips and other compensation” label is060.83 (versus the correct value of 9060.83.) The extracted value isflagged as incorrect when comparing it to the extracted value forFederal income tax withheld (106.11). The constraints for a W-2 formspecify that Federal income tax withholdings cannot exceed total wages.

The invention utilizes known constraints between the semantics ofextracted data elements to correct potentially incorrectly extracteddata. The constraints are specified by subject matter experts (forexample, Certified Public Accountants in the case of income tax forms);the constraints are also determined by analysis of data extracted fromforms that have been submitted by users of the data extraction system.In the above example illustrated in FIG. 39 and FIG. 40, the constraintsfor a W-2 form specify that, for wages below a threshold amount, in mostcases “Wages, tips and other compensation” is equal to “Social securitywages” and “Medicare wages and tips.” In this example, the constraintsindicate that when “Wages, tips and other compensation” is flagged asincorrect and differs by a single digit from “Social security wages,”then the value from “Social security wages” replaces the value of“Wages, tips and other compensation.”

The invention utilizes known constraints in the layout of data elements,to narrow the search space and thereby more accurately extract data. Thelayout constraints are specified by technical experts; the constraintsare also determined by analysis of data extracted from forms that havebeen submitted by users of the data extraction system. FIG. 41illustrates the relationship of layout elements in a portion of a W-2form. In FIG. 41, for example, the label “Social security wages” is tothe left of the label “Social security tax withheld.” This layoutrelationship and others, specified by experts or determined by analysis,are used to infer missing labels and also identify spurious data such aspencil marks, tick marks and other noise.

The invention predicts occurrences of instruction blocks based ondetected layout patterns from forms that have been submitted by users ofthe data extraction system. The invention eliminates such instructionblocks from further data extraction, thus simplifying the extractionprocess and thereby improving the accuracy of data extraction.

The invention detects tables using column layout and the expected headerlayout based on detected layout patterns from forms that have beensubmitted by users of the data extraction system. Known constraints, inthe form of relationships between header elements, are used to predictheaders when not correctly detected.

The layout of multiple occurrences of a particular extracted artifact,e.g. four occurrences of each expected data element in a W-2, is used toidentify the four logical records in the W-2.

The mechanism that was used to identify a particular data artifact, e.g.label identified by correct OCR text vs predicted label, is used toattach a confidence to the extracted data.

-   -   1. Infer labels    -   2. Identify instruction blocks, pencil marks, other “noise” etc.        and eliminate from search space    -   3. Map canonical and detected labels    -   4. Detect tables    -   5. Record detection    -   6. Attach confidence to extracted data

The invention utilizes layout data structure to extract data from formimages. The use of a layout data structure is illustrated in the contextof a portion of a W-2 form image shown in FIG. 41. First, the low-levellayout graph of confetti is created; its internal representation ispartially illustrated in FIG. 42. While the left, right, top, and bottomconnection sets exactly map the layout, for brevity, only the right anddown sets for each confetti is shown in FIG. 42. Second, labels aredetected. Third, as illustrated in FIG. 43, the layout graph is modifiedby identifying the detected labels (shown as light grey blocks). Fourth,the label-value correlations are determined (shown by the dark greyblocks). Note that the illustration shows the right set of each of thefeatures shown. Note also that the layout relations of the containedfeatures do not cross out of the container; this aspect of the datastructure significantly improves the efficiency of the data structure.Also shown are the down sets of each feature. The contained features canbe seen to maintain layout relations within the container, leaving it tothe container to maintain external layout relations.

The invention extracts data from an image via a process of progressiverefinement that utilizes contours matching (as described above). Whilecontour matching on its own is of limited value over a large universe oflabels, coupled with the progressive refinement technique, contourmatching is robust. As an example, the labels from the 1099-OID form ofFIG. 35 are shown in FIG. 44. Since there is significant similaritybetween the contours for “PAYER's federal identification number” and“RECIPIENT's federal identification number,” it is inappropriate todifferentiate these two labels using their contours. However,differentiating “RECIPIENT's name” from “PAYER'S name, street address,city, state, ZIP code and telephone no” is appropriate. Accordingly,contour matching is used in those cases in which the set of options issmall.

The invention utilizes contour matching along with text-based labelmatching as part of the progressive refinement process. Once the1099-OID form in FIG. 35 is correctly classified, for example, thesearch space for labels is restricted to labels that occur in a1099-OID. As part of the progressive refinement process, in thisexample, all the labels except “RECIPIENT's name” and “Original Issuediscount for 2009” were identified by text-based matching. Contourmatching is then used to distinguish between these two labels.

FIG. 13 is a system diagram of the service control manager 410. System410 has a main thread 1301, task queues 1302, database client threadcontrollers 1303, task queues 1304, slave controllers 1305 and SCM queue1306.

The main thread 1301 controls the primary state machine for all the jobsin the system.

Task queues 1302 provide message queues for database communication.

Database client thread controllers 1303 manage the database serverinterface.

Task queues 1304 provide message queues for communication with slavecontrollers.

Slave controllers 1305 manage various slave processes via the slavecontroller interface.

The SCM queue 1306 provides a mechanism for the various controllers tocommunicate with the main thread.

In the preferred implementation, various threads communicate with eachother using message queues. Whenever a new document is received forprocessing, the main thread is notified and it requests the databaseclient thread to retrieve the job for processing based on the states andthe queue of other jobs in the system.

In the preferred implementation, once the job is loaded in memory, afinite state machine for that job is created and the job starts to beprocessed. The main thread puts the job on a particular task queue basedon the state machine instructions. For example, if the job needs to beimage processed, then the job will be placed on the image processingtask queue. If the slave controller for the image processing slave findsan idle image processing slave process, then the job is picked up fromthat queue and given to the slave process for processing. Once the slavefinishes performing its assigned task, it returns the job to the slavecontroller which puts the job back on the SCM queue 1306. The mainthread sequentially picks up the job from the SCM queue 1306 and decideson the next state of the job based on the finite state machine states.Once a job is completed, the finite state machine for the job is closedand the extracted document is returned to the content repository 322 andmade available to the client's portal as a finished and processeddocument.

Alternatively, it is possible for a single process to implement all thefunctionality of the slaves as outlined in the description of thepreferred implementation. The ideas outlined for the preferredimplementation are all valid for such an implementation.

FIG. 18 is a diagram that depicts the various components of acomputerized document data extraction system, according to certainembodiments of the invention. An exemplary document data extractionsystem may include a host computer 1801 that contains volatile memory,1802, a persistent storage device such as a hard drive, 1808, aprocessor, 1803, and a network interface, 1804. Using the networkinterface, the system computer can interact with databases, 1805, 1806.Although FIG. 18 illustrates a system in which the system computer isseparate from the various databases, some or all of the databases may behoused within the host computer, eliminating the need for a networkinterface. The programmatic processes may be executed on a single host,as shown in FIG. 18, or they may be distributed across multiple hosts.

The host computer shown in FIG. 18 may serve as a document data analysissystem. The host computer receives electronic documents from multipleusers. Workstations may be connected to a graphical display device,1807, and to input devices such as a mouse, 1809, and a keyboard, 1810.Alternately, the active user's workstation may comprise a handhelddevice.

In some embodiments, the flow charts included in this applicationdescribe the logical steps that are embodied as computer executableinstructions that could be stored in computer readable medium, such asvarious memories and disks, that, when executed by a processor, such asa server or server cluster, cause the processor to perform the logicalsteps.

While text extraction and recognition may be performed with OCR andOCR-like techniques it is not limited to such. Other techniques could beused, including image recognition-like techniques.

As described above, preferred embodiments extract image features from adocument and use this to assist in dataifying the document category andextracting data from the document. These image features include inherentimage features, e.g. lines, line crossings, etc. that are put in placeby the document authors (or authors of an original source or blankdocument) to organize the document or the like. They were typically notincluded as a means of identifying the document, even though theinventors have discovered that they can be used as such, especially withthe use of machine learning techniques.

While many applications can benefit from extracting both image and textfeatures so that the extracted features may be used to dataify documentsand extract data from those documents, for some applications, imagefeatures alone may suffice. Specifically, some problem domains may havedocument categories where the inherent image features are sufficientlydistinctive to dataify a document and extract data with high enoughconfidence (even without processing text features.)

Preferred embodiments of the invention may incorporate classificationtechniques described in the following patent applications, each of whichis hereby incorporated by reference herein in its entirety:

U.S. Patent Application Publication No. 2009/0116736, entitled “Systemsand Methods to Automatically Classify Electronic Documents UsingExtracted Image and Text Features and Using a Machine LearningSubsystem;”

U.S. Patent Application Publication No. 2009/0116757, entitled “Systemsand Methods for Classifying Electronic Documents by Extracting andRecognizing Text and Image Features Indicative of Document Categories;”

U.S. Patent Application Publication No. 2009/0116755, entitled “Systemsand Methods for Enabling Manual Classification of Unrecognized Documentsto Complete Workflow for Electronic Jobs and to Assist Machine Learningof a Recognition System Using Automatically Extracted Features ofUnrecognized Documents;”

U.S. Patent Application Publication No. 2009/0116756, entitled “Systemsand Methods for Training a Document Classification System UsingDocuments from a Plurality of Users;”

U.S. Patent Application Publication No. 2009/0116746, entitled “Systemsand Methods for Parallel Processing of Document Recognition andClassification Using Extracted Image and Text Features;” and

U.S. Patent Application Publication No. 2009/0119296, entitled “Systemsand Methods for Handling and Distinguishing Binarized, BackgroundArtifacts in the Vicinity of Document Text and Image Features Indicativeof a Document Category.”

Although the invention has been described and illustrated in theforegoing illustrative embodiments, it is understood that the presentdisclosure has been made only by way of example, and that numerouschanges in the details of implementation of the invention can be madewithout departing from the spirit and scope of the invention. Featuresof the disclosed embodiments can be combined and rearranged in variousways.

1. In a document analysis system that receives and processes jobs from aplurality of users, in which each job may contain multiple electronicdocuments, to extract data from the electronic documents, a method ofautomatically extracting data from each received electronic document atleast in part using data external to the electronic document butassociated with the job containing the document, the method comprising:analyzing each electronic document in a job to automatically extractimages and text features; and if any of the images and text featuresextracted from the electronic document is not recognized, using dataexternal to said document but associated with said job to identify theunrecognized feature, wherein the external source may be one of at leastone other document in the job and a database having known valuesassociated with the job.